The Relationship Between Income and Life Expectancy
Final Project
1 Data
1.1 Variables
1.2 Daily Mean Income Household Per Capita
GapMinder compiled data on daily mean household income per capita to analyze income distributions over hundreds of years. These figures are anchored in the official Mean Income indicator from the World Bank, derived from household surveys. For countries lacking World Bank data, GapMinder estimated mean income based on GDP per capita. The available time frame for actual World Bank data spans from 1967 to 2021, though most countries have limited data points within these years.
GapMinder used growth rates of constant dollar GDP per capita to estimate mean incomes historically from 1800 and project them up to 2100. For the period 1981-2019, they relied on World Bank data, known for its comprehensive coverage, published in the World Development Indicators as “Survey mean consumption or income per capita, total population (2017 PPP dollars per day).” The indicator they reference is described in the World Bank Poverty and Inequality Platform (PIP) as “Indicators Survey mean/average consumption or income per capita, total population (2017 PPP dollars per day). The average consumption is unimportant to our project but is collected in tandem by the World Bank with Income Per Capita. The mean represents the average monthly household per capita income or consumption expenditure from the survey in 2017 PPP.”
In short, the average daily income is the mean daily household per capita income or consumption expenditure from the survey, expressed in 2017 constant international dollars.
1.2.1 Life Expectancy
GapMinder collects life expectancy data from various sources to create a comprehensive dataset spanning from 1800 to 2100. Life expectancy at birth refers to the average number of years a newborn is expected to live, assuming that current mortality rates remain constant throughout their lifetime.
For the period from 1800 to 1970, GapMinder relies on its own compiled data (version 7), which includes information from over 100 sources and accounts for historical events causing significant mortality dips. From 1950 to 2019, data is primarily sourced from the Global Burden of Disease Study 2019 by the Institute for Health Metrics and Evaluation (IHME). This source provides detailed annual estimates. For projections from 2020 to 2100, GapMinder uses forecasts from the United Nations’ World Population Prospects 2022. The data is carefully combined, prioritizing IHME data when available, and extending IHME series with UN estimates for future projections.
1.3 Hypothesized Relationship Between the Variables
Higher average daily income is positively associated with higher life expectancy at birth.
1.4 How the Data was Cleaned
To the clean the data, we looked at the data types of the values and saw that all the numbers were of the type, character, despite having their class be numeric. To clean this, we mutated each year’s column to be a numeric type.
The year names initially had an X in front of the name when the data was first loaded. We chose to remove this naming convention after pivoting the data so that we can easily reference the years when graphing our data.
Instead of eliminating NA values in average, those values were left so that when joining the data, we can make a decision which years or countries to pick based on data that overlaps between the data frames.
1.5 How the Data was Pivoted
Next, we pivoted the data by country to separate each year into individual observations. For each country and year, we now have the corresponding average daily income and average life expectancy.
1.6 How the Data was Joined
In order to create one data table, we must join our two data sets that were cleaned and pivoted. One way we can do this is through an inner join, which will also handle and missing data by dropping it.
In addition to joining the data, the name of the “country” column was capitalized in order to have uniformity among the variable names.
2 Linear Regression
2.1 Exploring the Relationship Between the Two Variables
The variables to be explored are the average daily income in relation to the average life expectancy. The relationship to be explored is how the income effects the life expectancy.
The explanatory variable is the average income and the response variable is the average life expectancy.
To explore the relationship overtime, visuals were constructed for each year:
The visual with all the years displays a log curve and not necessarily a linear relationship. It can be seen that with a higher average daily income, the life expectancy at birth is also higher. Taking a loook year by year, as time progresses, the average daily income also generally increases. This is expected due to the change in value of currency and inflation.
2.2 Linear Regression
2.2.1 Steps to Choosing Regression Features
Linear regression was simplified by taking the year 2010. The reason for this is because daily income and life expectancy have shown significant changes over the centuries, making it challenging to capture the full extent of these trends in a single regression model.
Historical data from the 1800s to the present day illustrates substantial shifts in both daily income and life expectancy, reflecting changes in economic, social, and healthcare systems globally.
By selecting the year 2010 as a reference point, we aim to focus on a period that represents a modern snapshot of these trends. Here’s why 2010 is a good choice:
. Representative Modern Era: 2010 serves as a representative point in the modern era, offering insights into contemporary socioeconomic and health conditions across countries.
. Mitigation of Predicted Data: The decision to exclude years beyond 2010 accounts for the absence of actual data and instead focuses on observed trends. This approach prevents potential biases introduced by predicted data, particularly in later years beyond the data collection timeframe.
. Adequate Time for Analysis: With 14 years having passed since 2010, this timeframe provides sufficient data for analysis while minimizing the impact of short-term fluctuations that may occur within smaller time intervals.
By anchoring our analysis to the year 2010, we aim to capture meaningful trends in daily income and life expectancy while ensuring the reliability and relevance of our linear regression model.
2.2.2 Regression Code
Call:
lm(formula = life_expectancy_2010 ~ log(daily_income_2010), data = average_data_years)
Coefficients:
(Intercept) log(daily_income_2010)
53.567 6.802
The linear regression formula is \(\hat{y} = 53.5367+6.802\times log(x)\) where \(x\) is the daily income in 2010 and \(y\) is the life expectancy in 2010.
2.2.3 Interpretation of coefficients:
Intercept (53.5367): The intercept term represents the estimated life expectancy in the year 2010 when daily income is 1, since log(1) is equal to 0. If daily income is 1, then the model predicts that on average, life expectancy is 53.5367 years.
Daily Income Coefficient (6.802): The daily income coefficient indicates that for every 1 percent change in daily income, life expectancy increases by 6.8%, on average.
These interpretations provide insights into the relationship between daily income and life expectancy in the year 2010, as captured by the estimated regression model.
2.3 Model Fit
| Total Variance | Fitted Variance | Residual Variance |
|---|---|---|
| 75.7859 | 49.26955 | 26.51635 |
The total variance in the model is 75.79. Of the total variance, 49.27 is explained by the model, which leads us to an \(R^2\) of 65.01%. The remaining 26.51 in the total variance is unexplained.
Based on the \(R^2\) of 65.01%, the model quality is moderately good. While the log of daily income explains the majority of the variability of life expectancy, there is still a significant amount of variance unexplained in the model.
3 Simulated Data
3.1 Visual of Simulated Data
Both plots show the same general trend, that there is a positive relationship between daily income and life expectancy. The data is more dense at lower income levels, coinciding with that the majority of the population has relatively low income levels. Since we focused on the year 2010 for the simulated data, there is more clustering at specific income levels in the plot of simulated data since there was only 195 observations for 2010. As a result, the observed data is more continuous.
3.2 Distribution of R-squared
In this plot, we see that the simulated datasets have \(R^2\) values between 0.5 and 0.65. The peak of the histogram is around an \(R^2\) value of approximately 0.575. The distribution of \(R^2\) values suggests that the majority of the simulated models explain between 55% and 60% of the variability in the data. Since the simulated data centers around a mean of 0.575, it shows that the statistical model used for the simulation is relatively effective, as our observed \(R^2\) was 0.6501. However, a small portion of the simulated \(R^2\) values reached as high as 0.65, suggesting that while most of the simulated models underperform compared to our observed model, some models do get close to the observed level of explanatory power.